lr 2
Differentially Private Clipped-SGD: High-Probability Convergence with Arbitrary Clipping Level
Khah, Saleh Vatan, Chezhegov, Savelii, Farahmand, Shahrokh, Horváth, Samuel, Gorbunov, Eduard
Stochastic first-order optimization methods, such as Stochastic Gradient Descent ( SGD) (Robbins and Monro, 1951), AdaGrad (Streeter and McMahan, 2010; Duchi et al., 2011), and Adam (Kingma and Ba, 2014), are fundamental for training modern Machine Learning (ML) and Deep Learning (DL) models. However, these methods are often enhanced with additional algorithmic techniques that play a critical role in their convergence and practical performance. Among these, gradient clipping (Pascanu et al., 2013) is one of the most widely used and well-studied approaches. In recent years, substantial efforts have been made to theoretically understand the advantages of gradient clipping and its impact on the convergence of stochastic optimization algorithms. In particular, gradient clipping is a key component in managing heavy-tailed noise, which commonly arises in the training of language models on textual data (Zhang et al., 2020b), in the training of GANs (Goodfellow et al., 2014; Gorbunov et al., 2022), and even in simpler tasks such as image classification (S im sekli et al., 2019). This approach is primarily analyzed through the lens of high-probability convergence, as such guarantees provide a more accurate reflection of the actual behavior of optimization methods compared to their more conventional in-expectation counterparts (Gorbunov et al., 2020).
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Russia (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems
Puchkin, Nikita, Suchkov, Denis, Naumov, Alexey, Belomestny, Denis
Modern methods of generative modelling and unpaired image-to-image translation based on Schrödinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired image-to-image translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrödinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schrödinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.
Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget
The challenge of identifying the best feasible arm within a fixed budget has attracted considerable interest in recent years. However, a notable gap remains in the literature: the exact exponential rate at which the error probability approaches zero has yet to be established, even in the relatively simple setting of $K$-armed bandits with Gaussian noise. In this paper, we address this gap by examining the problem within the context of linear bandits. We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability. Remarkably, the decay rate -- characterized by the exponent -- matches the theoretical lower bound derived using information-theoretic principles. Our approach leverages a posterior sampling framework embedded within a game-based sampling rule involving a min-learner and a max-learner. This strategy shares its foundations with Thompson sampling, but is specifically tailored to optimize the identification process under fixed-budget constraints. Furthermore, we validate the effectiveness of our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity. The results corroborate our theoretical findings and demonstrate that our method outperforms several benchmark algorithms in terms of both accuracy and efficiency.
- Asia > Singapore > Central Region > Singapore (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
LR$^2$Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Chen, Jianghao, Wei, Zhenlin, Ren, Zhenjiang, Li, Ziyong, Zhang, Jiajun
Recent progress in o1-like models has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. We conduct extensive evaluation on both conventional models and o1-like models. Our experimental results reveal that even the most advanced reasoning-specific models, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs. The leaderboard of our benchmark is available at https://huggingface.co/spaces/UltraRonin/LR2Bench
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- (3 more...)
- Leisure & Entertainment > Sports (0.68)
- Leisure & Entertainment > Games (0.48)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks
Fu, Hengyu, Wang, Zihao, Nichani, Eshaan, Lee, Jason D.
In deep learning theory, a critical question is to understand how neural networks learn hierarchical features. In this work, we study the learning of hierarchical polynomials of \textit{multiple nonlinear features} using three-layer neural networks. We examine a broad class of functions of the form $f^{\star}=g^{\star}\circ \bp$, where $\bp:\mathbb{R}^{d} \rightarrow \mathbb{R}^{r}$ represents multiple quadratic features with $r \ll d$ and $g^{\star}:\mathbb{R}^{r}\rightarrow \mathbb{R}$ is a polynomial of degree $p$. This can be viewed as a nonlinear generalization of the multi-index model \citep{damian2022neural}, and also an expansion upon previous work that focused only on a single nonlinear feature, i.e. $r = 1$ \citep{nichani2023provable,wang2023learning}. Our primary contribution shows that a three-layer neural network trained via layerwise gradient descent suffices for \begin{itemize}\item complete recovery of the space spanned by the nonlinear features \item efficient learning of the target function $f^{\star}=g^{\star}\circ \bp$ or transfer learning of $f=g\circ \bp$ with a different link function \end{itemize} within $\widetilde{\cO}(d^4)$ samples and polynomial time. For such hierarchical targets, our result substantially improves the sample complexity ${\Theta}(d^{2p})$ of the kernel methods, demonstrating the power of efficient feature learning. It is important to highlight that{ our results leverage novel techniques and thus manage to go beyond all prior settings} such as single-index and multi-index models as well as models depending just on one nonlinear feature, contributing to a more comprehensive understanding of feature learning in deep learning.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China (0.04)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime
Bychkov, Georgii, Dvinskikh, Darina, Antsiferova, Anastasia, Gasnikov, Alexander, Lobanov, Aleksandr
We present a novel gradient-free algorithm to solve a convex stochastic optimization problem, such as those encountered in medicine, physics, and machine learning (e.g., adversarial multi-armed bandit problem), where the objective function can only be computed through numerical simulation, either as the result of a real experiment or as feedback given by the function evaluations from an adversary. Thus we suppose that only a black-box access to the function values of the objective is available, possibly corrupted by adversarial noise: deterministic or stochastic. The noisy setup can arise naturally from modeling randomness within a simulation or by computer discretization, or when exact values of function are forbidden due to privacy issues, or when solving non-convex problems as convex ones with an inexact function oracle. By exploiting higher-order smoothness, fulfilled, e.g., in logistic regression, we improve the performance of zero-order methods developed under the assumption of classical smoothness (or having a Lipschitz gradient). The proposed algorithm enjoys optimal oracle complexity and is designed under an overparameterization setup, i.e., when the number of model parameters is much larger than the size of the training dataset. Overparametrized models fit to the training data perfectly while also having good generalization and outperforming underparameterized models on unseen data. We provide convergence guarantees for the proposed algorithm under both types of noise. Moreover, we estimate the maximum permissible adversarial noise level that maintains the desired accuracy in the Euclidean setup, and then we extend our results to a non-Euclidean setup. Our theoretical results are verified on the logistic regression problem.
- Europe > Russia > Volga Federal District > Republic of Tatarstan (0.14)
- Asia > Russia (0.04)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- (2 more...)
Improved Rate of First Order Algorithms for Entropic Optimal Transport
Luo, Yiling, Xie, Yiling, Huo, Xiaoming
This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. The resulting rate for approximating the optimal transport (OT) has been improved from $\widetilde{{O}}({n^{2.5}}/{\epsilon})$ to $\widetilde{{O}}({n^2}/{\epsilon})$, where $n$ is the problem size and $\epsilon$ is the accuracy level. In particular, we propose an accelerated primal-dual stochastic mirror descent algorithm with variance reduction. Such special design helps us improve the rate compared to other accelerated primal-dual algorithms. We further propose a batch version of our stochastic algorithm, which improves the computational performance through parallel computing. To compare, we prove that the computational complexity of the Stochastic Sinkhorn algorithm is $\widetilde{{O}}({n^2}/{\epsilon^2})$, which is slower than our accelerated primal-dual stochastic mirror algorithm. Experiments are done using synthetic and real data, and the results match our theoretical rates. Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $\widetilde{{O}}({n^2}/{\epsilon})$ for solving OT.
Improved Regret Bounds for Online Submodular Maximization
Sadeghi, Omid, Raut, Prasanna, Fazel, Maryam
In this paper, we consider an online optimization problem over $T$ rounds where at each step $t\in[T]$, the algorithm chooses an action $x_t$ from the fixed convex and compact domain set $\mathcal{K}$. A utility function $f_t(\cdot)$ is then revealed and the algorithm receives the payoff $f_t(x_t)$. This problem has been previously studied under the assumption that the utilities are adversarially chosen monotone DR-submodular functions and $\mathcal{O}(\sqrt{T})$ regret bounds have been derived. We first characterize the class of strongly DR-submodular functions and then, we derive regret bounds for the following new online settings: $(1)$ $\{f_t\}_{t=1}^T$ are monotone strongly DR-submodular and chosen adversarially, $(2)$ $\{f_t\}_{t=1}^T$ are monotone submodular (while the average $\frac{1}{T}\sum_{t=1}^T f_t$ is strongly DR-submodular) and chosen by an adversary but they arrive in a uniformly random order, $(3)$ $\{f_t\}_{t=1}^T$ are drawn i.i.d. from some unknown distribution $f_t\sim \mathcal{D}$ where the expected function $f(\cdot)=\mathbb{E}_{f_t\sim\mathcal{D}}[f_t(\cdot)]$ is monotone DR-submodular. For $(1)$, we obtain the first logarithmic regret bounds. In terms of the second framework, we show that it is possible to obtain similar logarithmic bounds with high probability. Finally, for the i.i.d. model, we provide algorithms with $\tilde{\mathcal{O}}(\sqrt{T})$ stochastic regret bound, both in expectation and with high probability. Experimental results demonstrate that our algorithms outperform the previous techniques in the aforementioned three settings.
- North America > United States > Washington > King County > Seattle (0.14)
- Europe > Spain > Canary Islands (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (3 more...)
Variance Reduction in Training Forecasting Models with Subgroup Sampling
Lu, Yucheng, Park, Youngsuk, Chen, Lifan, Wang, Yuyang, De Sa, Christopher, Foster, Dean
In real-world applications of large-scale time series, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large gradient variance, and thus requires long time training. To alleviate this issue, we propose a sampling strategy named Subgroup Sampling, which mitigates the large variance via sampling over pre-grouped time series. We further introduce SCott, a variance reduced SGD-style optimizer that co-designs subgroup sampling with the control variate method. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and show SCott converges faster with respect to both iterations and wall clock time. Additionally, we show two SCott variants that can speed up Adam and Adagrad without compromising generalization of forecasting models.
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
- Asia > Middle East > Jordan (0.04)
- Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
- (11 more...)
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Online DR-Submodular Maximization with Stochastic Cumulative Constraints
Raut, Prasanna Sanjay, Sadeghi, Omid, Fazel, Maryam
In this paper, we consider online continuous DR-submodular maximization with linear stochastic long-term constraints. Compared to the prior work on online submodular maximization, our setting introduces the extra complication of stochastic linear constraint functions that are i.i.d. generated at each round. To be precise, at step $t\in\{1,\dots,T\}$, a DR-submodular utility function $f_t(\cdot)$ and a constraint vector $p_t$, i.i.d. generated from an unknown distribution with mean $p$, are revealed after committing to an action $x_t$ and we aim to maximize the overall utility while the expected cumulative resource consumption $\sum_{t=1}^T \langle p,x_t\rangle$ is below a fixed budget $B_T$. Stochastic long-term constraints arise naturally in applications where there is a limited budget or resource available and resource consumption at each step is governed by stochastically time-varying environments. We propose the Online Lagrangian Frank-Wolfe (OLFW) algorithm to solve this class of online problems. We analyze the performance of the OLFW algorithm and we obtain sub-linear regret bounds as well as sub-linear cumulative constraint violation bounds, both in expectation and with high probability.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- (10 more...)